motion feature
- North America > United States > California > Los Angeles County > Long Beach (0.04)
- Asia > India > Tamil Nadu > Chennai (0.04)
Movement-Specific Analysis for FIM Score Classification Using Spatio-Temporal Deep Learning
Masaki, Jun, Higashi, Ariaki, Shinagawa, Naoko, Hirata, Kazuhiko, Kurita, Yuichi, Furui, Akira
The functional independence measure (FIM) is widely used to evaluate patients' physical independence in activities of daily living. However, traditional FIM assessment imposes a significant burden on both patients and healthcare professionals. To address this challenge, we propose an automated FIM score estimation method that utilizes simple exercises different from the designated FIM assessment actions. Our approach employs a deep neural network architecture integrating a spatial-temporal graph convolutional network (ST-GCN), bidirectional long short-term memory (BiLSTM), and an attention mechanism to estimate FIM motor item scores. The model effectively captures long-term temporal dependencies and identifies key body-joint contributions through learned attention weights. We evaluated our method in a study of 277 rehabilitation patients, focusing on FIM transfer and locomotion items. Our approach successfully distinguishes between completely independent patients and those requiring assistance, achieving balanced accuracies of 70.09-78.79 % across different FIM items. Additionally, our analysis reveals specific movement patterns that serve as reliable predictors for particular FIM evaluation items.
Occlusion-Aware Diffusion Model for Pedestrian Intention Prediction
Liu, Yu, Liu, Zhijie, Yang, Zedong, Li, You-Fu, Kong, He
Abstract--Predicting pedestrian crossing intentions is crucial for the navigation of mobile robots and intelligent vehicles. Although recent deep learning-based models have shown significant success in forecasting intentions, few consider incomplete observation under occlusion scenarios. T o tackle this challenge, we propose an Occlusion-A ware Diffusion Model (ODM) that reconstructs occluded motion patterns and leverages them to guide future intention prediction. During the denoising stage, we introduce an occlusion-aware diffusion transformer architecture to estimate noise features associated with occluded patterns, thereby enhancing the model's ability to capture contextual relationships in occluded semantic scenarios. Furthermore, an occlusion mask-guided reverse process is introduced to effectively utilize observation information, reducing the accumulation of prediction errors and enhancing the accuracy of reconstructed motion features. The performance of the proposed method under various occlusion scenarios is comprehensively evaluated and compared with existing methods on popular benchmarks, namely PIE and JAAD. Extensive experimental results demonstrate that the proposed method achieves more robust performance than existing methods in the literature. ITH the rapid advancement of intelligent sensing and computing technologies, much progress has been made in recent years in developing autonomous vehicles to enhance traffic efficiency and road safety. To prevent collisions, path planning of autonomous vehicles [1], [2] is essential, requiring an understanding of interactions between road users and the ability to forecast their potential actions [3]-[5]. This manuscript has been accepted to the IEEE Transactions on Intelligent Transportation Systems as a regular paper. Y u Liu is also with the Department of Mechanical Engineering, City University of Hong Kong, Hong Kong SAR, China. Y ou-Fu Li is with the Department of Mechanical Engineering, City University of Hong Kong, Hong Kong SAR, China. The typical scenario of visual occlusion is illustrated here. Solid green lines represent the parts of the observation that are within the field of view and visible, while dashed red lines indicate positional features that are undetectable due to occlusion.
- Asia > China > Hong Kong (0.85)
- Asia > China > Guangdong Province > Shenzhen (0.05)
- Asia > China > Heilongjiang Province > Harbin (0.04)
- (8 more...)
- Transportation > Ground > Road (1.00)
- Transportation > Infrastructure & Services (0.87)
LILAC: Long-sequence Incremental Low-latency Arbitrary Motion Stylization via Streaming VAE-Diffusion with Causal Decoding
Generating long and stylized human motions in real time is critical for applications that demand continuous and responsive character control. Despite its importance, existing streaming approaches often operate directly in the raw motion space, leading to substantial computational overhead and making it difficult to maintain temporal stability. In contrast, latent-space VAE-Diffusion-based frameworks alleviate these issues and achieve high-quality stylization, but they are generally confined to offline processing. To bridge this gap, LILAC (Long-sequence Incremental Low-latency Arbitrary Motion Stylization via Streaming VAE-Diffusion with Causal Decoding) builds upon a recent high-performing of-fline framework for arbitrary motion stylization and extends it to an online setting through a latent-space streaming architecture with a sliding-window causal design and the injection of decoded motion features to ensure smooth motion transitions. This architecture enables long-sequence real-time arbitrary stylization without relying on future frames or modifying the diffusion model architecture, achieving a favorable balance between stylization quality and responsiveness as demonstrated by experiments on benchmark datasets. Supplementary video and examples are available at the project page: https://pren1.github.io/lilac/.
- Europe > Switzerland (0.05)
- North America > United States > New York > New York County > New York City (0.04)
FTIN: Frequency-Time Integration Network for Inertial Odometry
Zhang, Shanshan, Zhang, Qi, Wang, Siyue, Wu, Liqin, Wen, Tianshui, Zhou, Ziheng, Peng, Ao, Hong, Xuemin, Zheng, Lingxiang, Yang, Yu
However, high IMU sampling rates introduce substantial redundancy that impedes IO's ability to attend to salient components, thereby creating an information bottleneck. To address this challenge, we propose a cross-domain IO framework that fuses information from the frequency and time domains. Specifically, we exploit the global context and energy-compaction properties of frequency-domain representations to capture holistic motion patterns and alleviate the bottleneck. To the best of our knowledge, this is among the first attempts to incorporate frequency-domain feature processing into IO. Experimental results on multiple public datasets demonstrate the effectiveness of the proposed frequency-time-domain fusion strategy. Index T erms-- Frequency-Domain Learning, Inertial Odometry, Inertial Measurement Unit signals 1. INTRODUCTION Inertial odometry (IO) aims to reconstruct motion trajectories from high-frequency inertial measurement unit (IMU) signals--comprising tri-axial accelerometer and gyroscope data--in order to enable low-cost and robust localization [1, 2].
Export Reviews, Discussions, Author Feedback and Meta-Reviews
First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. Summary: The paper presents an extension of gated auto encoders to time-series data. The main idea is to use a gated auto encoder to model the time series in an autoregressive manner; predicting x_{t+1} from x_t using a gated autoencoder whose mapping unit values are initialised using a pair of contiguous datapoints. The paper introduces two interesting refinements: predictive training, and higher order relational features. Predictive training is a training criterion suitable for time series data that is different from the criterion normally used for gated auto encoders. Predictive training tries to minimise the square error in predicting x_{t+1} given x_{t} and the value of the mapping units that optimally predict x_{t} given x_{t-1}.
MemoryTalker: Personalized Speech-Driven 3D Facial Animation via Audio-Guided Stylization
Kim, Hyung Kyu, Lee, Sangmin, Kim, Hak Gu
Speech-driven 3D facial animation aims to synthesize realistic facial motion sequences from given audio, matching the speaker's speaking style. However, previous works often require priors such as class labels of a speaker or additional 3D facial meshes at inference, which makes them fail to reflect the speaking style and limits their practical use. To address these issues, we propose MemoryTalker which enables realistic and accurate 3D facial motion synthesis by reflecting speaking style only with audio input to maximize usability in applications. Our framework consists of two training stages: 1-stage is storing and retrieving general motion (i.e., Memorizing), and 2-stage is to perform the personalized facial motion synthesis (i.e., Animating) with the motion memory stylized by the audio-driven speaking style feature. In this second stage, our model learns about which facial motion types should be emphasized for a particular piece of audio. As a result, our MemoryTalker can generate a reliable personalized facial animation without additional prior information. With quantitative and qualitative evaluations, as well as user study, we show the effectiveness of our model and its performance enhancement for personalized facial animation over state-of-the-art methods.
- North America > United States > Texas > Travis County > Austin (0.04)
- Europe > Italy > Piedmont > Turin Province > Turin (0.04)
- Research Report > Promising Solution (0.48)
- Research Report > New Finding (0.46)
- Information Technology > Graphics > Animation (1.00)
- Information Technology > Artificial Intelligence > Vision > Face Recognition (0.98)
EXPOTION: Facial Expression and Motion Control for Multimodal Music Generation
Izzati, Fathinah, Li, Xinyue, Xia, Gus
We propose Expotion (Facial Expression and Motion Control for Multimodal Music Generation), a generative model leveraging multimodal visual controls - specifically, human facial expressions and upper-body motion - as well as text prompts to produce expressive and temporally accurate music. We adopt parameter-efficient fine-tuning (PEFT) on the pretrained text-to-music generation model, enabling fine-grained adaptation to the multimodal controls using a small dataset. To ensure precise synchronization between video and music, we introduce a temporal smoothing strategy to align multiple modalities. Experiments demonstrate that integrating visual features alongside textual descriptions enhances the overall quality of generated music in terms of musicality, creativity, beat-tempo consistency, temporal alignment with the video, and text adherence, surpassing both proposed baselines and existing state-of-the-art video-to-music generation models. Additionally, we introduce a novel dataset consisting of 7 hours of synchronized video recordings capturing expressive facial and upper-body gestures aligned with corresponding music, providing significant potential for future research in multimodal and interactive music generation.
- Asia > South Korea > Daejeon > Daejeon (0.04)
- Asia > Middle East > UAE (0.04)
- Media > Music (1.00)
- Leisure & Entertainment (1.00)
Suite-IN++: A FlexiWear BodyNet Integrating Global and Local Motion Features from Apple Suite for Robust Inertial Navigation
Sun, Lan, Xia, Songpengcheng, Yang, Jiarui, Pei, Ling
The proliferation of wearable technology has established multi-device ecosystems comprising smartphones, smartwatches, and headphones as critical enablers for ubiquitous pedestrian localization. However, traditional pedestrian dead reckoning (PDR) struggles with diverse motion modes, while data-driven methods, despite improving accuracy, often lack robustness due to their reliance on a single-device setup. Therefore, a promising solution is to fully leverage existing wearable devices to form a flexiwear bodynet for robust and accurate pedestrian localization. This paper presents Suite-IN++, a deep learning framework for flexiwear bodynet-based pedestrian localization. Suite-IN++ integrates motion data from wearable devices on different body parts, using contrastive learning to separate global and local motion features. It fuses global features based on the data reliability of each device to capture overall motion trends and employs an attention mechanism to uncover cross-device correlations in local features, extracting motion details helpful for accurate localization. To evaluate our method, we construct a real-life flexiwear bodynet dataset, incorporating Apple Suite (iPhone, Apple Watch, and AirPods) across diverse walking modes and device configurations. Experimental results demonstrate that Suite-IN++ achieves superior localization accuracy and robustness, significantly outperforming state-of-the-art models in real-life pedestrian tracking scenarios.
- Asia > China > Shanghai > Shanghai (0.05)
- Asia > China > Jiangsu Province > Nanjing (0.04)
- Asia > China > Hubei Province > Wuhan (0.04)
- (2 more...)
- Information Technology > Hardware (1.00)
- Information Technology > Communications > Mobile (1.00)
- Information Technology > Artificial Intelligence > Vision (1.00)
- (3 more...)
SALAD: Skeleton-aware Latent Diffusion for Text-driven Motion Generation and Editing
Hong, Seokhyeon, Kim, Chaelin, Yoon, Serin, Nam, Junghyun, Cha, Sihun, Noh, Junyong
Text-driven motion generation has advanced significantly with the rise of denoising diffusion models. However, previous methods often oversimplify representations for the skeletal joints, temporal frames, and textual words, limiting their ability to fully capture the information within each modality and their interactions. Moreover, when using pre-trained models for downstream tasks, such as editing, they typically require additional efforts, including manual interventions, optimization, or fine-tuning. In this paper, we introduce a skeleton-aware latent diffusion (SALAD), a model that explicitly captures the intricate inter-relationships between joints, frames, and words. Furthermore, by leveraging cross-attention maps produced during the generation process, we enable attention-based zero-shot text-driven motion editing using a pre-trained SALAD model, requiring no additional user input beyond text prompts. Our approach significantly outperforms previous methods in terms of text-motion alignment without compromising generation quality, and demonstrates practical versatility by providing diverse editing capabilities beyond generation. Code is available at project page.